Quantitative Methodology (UPF)
“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.
Source: R for Data Science
A df is tidy if it fulfills these requirements (Wickham 2014):
Why do we need to clean data?
Load packages.
Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).
Table A
# A tibble: 6 × 3
country year rate
* <chr> <int> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Table B
# A tibble: 6 × 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Table C
# A tibble: 12 × 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Table D
# A tibble: 3 × 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
NOO ANOTHER
Correlates of War Intra-State War Data
# A tibble: 442 × 4
WarName WarType SideA SideB
<chr> <dbl> <chr> <chr>
1 First Caucasus 5 Russia Georgians, Dhagestania, Chechens
2 Sidon-Damascus 6 Sidon Damascus & Aleppo
3 First Two Sicilies 4 Austria -8
4 First Two Sicilies 4 Two Sicilies Liberals
5 Spanish Royalists 4 Spain Royalists
6 Sardinian Revolt 4 Austria -8
7 Sardinian Revolt 4 Sardinia Carbonari
8 Greek Independence 5 Ottoman Empire Greeks
9 Greek Independence 5 -8 United Kingdom
10 Greek Independence 5 -8 France
# … with 432 more rows
PEW Research Religious Landscape Study1
# A tibble: 18 × 6
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76
2 Atheist 12 27 37 52 35
3 Buddhist 27 21 30 34 33
4 Catholic 418 617 732 670 638
5 Don’t know/refused 15 14 15 11 10
6 Evangelical Prot 575 869 1064 982 881
7 Hindu 1 9 7 9 11
8 Historically Black Prot 228 244 236 238 197
9 Jehovah's Witness 20 27 24 24 21
10 Jewish 19 19 25 25 30
11 Mainline Prot 289 495 619 655 651
12 Mormon 29 40 48 51 56
13 Muslim 6 7 9 10 9
14 Orthodox 13 17 23 32 32
15 Other Christian 9 7 11 13 13
16 Other Faiths 20 33 40 46 49
17 Other World Religions 5 2 3 4 2
18 Unaffiliated 217 299 374 365 341
We change the rows and columns of the dataframe keeping the same information.
pivot_longer(df, cols, names_to, names_to)
pivot_wider(df, names_from, values_from)
separate(df, col, into, sep)
unite(df, col, ..., sep)
NA (Not Available)
Related with the state capacity to collect them (Stone2008?)
Check the World Bank
Small or underdeveloped states
Quantitative Methodology (UPF)